Data Pre Processing

Disciplined Workflow

Since data manipulation is easier in pandas, and there are multiple iterative steps ahead, we shall stick to a discipline. Data Pre Processing and Data Standardization are two main stages of data manipulation. After that, we only tune classifiers and evaluate. So for these first two stages, after we load raw data_dict, we shall convert to pandas data frame, and perform operations related to these stages and then before Classifier tuning, we convert it back to my_dataset as needed by our starter code.

In [50]:
from graphviz import Digraph
g = Digraph('Work Flow',  node_attr={'shape': 'record'})
g.attr(rankdir='LR')

g.node('step_0', r'Dict To Df', style='filled', fillcolor='#BBDEFB')
g.node('step_1', r'1. Data\nPreProcessing',style='filled', fillcolor='#e6a26a')
g.node('step_2', r'2. Data\nStandardization', style='filled', fillcolor='#e6a26a')
g.node('step_2_5', r'Df to Dict', style='filled', fillcolor='#BBDEFB')

g.node('step_3', r'3. Classifier\nTuning', style='filled', fillcolor='#e6a26a')
g.node('step_4', r'4. Evaluation\nMetrics', style='filled', fillcolor='#e6a26a')

g.edges([('step_0','step_1'),('step_1','step_2'),('step_2','step_2_5'),('step_2_5','step_3'),('step_3','step_4')])
g
Out[50]:
Work Flow step_0 Dict To Df step_1 1. Data PreProcessing step_0->step_1 step_2 2. Data Standardization step_1->step_2 step_2_5 Df to Dict step_2->step_2_5 step_3 3. Classifier Tuning step_2_5->step_3 step_4 4. Evaluation Metrics step_3->step_4

Jumping into our current code setup, we thus would have something like this to start up. Please recall functions from earlier section, now they are just pushed to library for brevity. When we define new functions,we also will add them to library when we try to reuse next time. Also note, we have switched to Decision tree classifier as decided earlier.

In [51]:
from helpers_2 import init, split_data, tree_classify, evaluate
import pandas as pd

# raw data
data_dict = init()

# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index')  # 'index' means keys should be rows

#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
# 1.2. Remove Outliers

# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------

# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}

# CLASSIFIER
my_dataset = data_dict  # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0)  # random state 0 to maintain consistent results everytime its run

# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)
Udacity's Evaluation:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Accuracy: 0.69210   Precision: 0.23619  Recall: 0.24150 F1: 0.23881 F2: 0.24042
Total predictions: 10000
True positives:  483    False positives: 1562   False negatives: 1517   True negatives: 6438

1.1 Cleaning Invalid Values

Let us have a sneak peak in to our raw data

In [52]:
df.head()
Out[52]:
salary to_messages deferral_payments total_payments exercised_stock_options bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value ... loan_advances from_messages other from_this_person_to_poi poi director_fees deferred_income long_term_incentive email_address from_poi_to_this_person
ALLEN PHILLIP K 201955 2902 2869717 4484442 1729541 4175000 126027 1407 -126027 1729541 ... NaN 2195 152 65 False NaN -3081055 304805 phillip.allen@enron.com 47
BADUM JAMES P NaN NaN 178980 182466 257817 NaN NaN NaN NaN 257817 ... NaN NaN NaN NaN False NaN NaN NaN NaN NaN
BANNANTINE JAMES M 477 566 NaN 916197 4046157 NaN 1757552 465 -560222 5243487 ... NaN 29 864523 0 False NaN -5104 NaN james.bannantine@enron.com 39
BAXTER JOHN C 267102 NaN 1295738 5634343 6680544 1200000 3942714 NaN NaN 10623258 ... NaN NaN 2660303 NaN False NaN -1386055 1586055 NaN NaN
BAY FRANKLIN R 239671 NaN 260455 827696 NaN 400000 145796 NaN -82782 63014 ... NaN NaN 69 NaN False NaN -201641 NaN frank.bay@enron.com NaN

5 rows ' 21 columns

Note the invalid values are marked with 'NaN'. Note that out of all features, except email_address and poi others are numerical. For these numerical values, we learn, 'NaN' means not, not available but 0. Before trying to spot outliers, we need to clean such invalid entries

In [53]:
def clean_df(raw_df):
    """
    Replace 'NaN' with 0 in all feature columns, except 'poi' and 'email_address'
    Return updated df
    """
    # replace NaN except poi and email_address (ref: https://stackoverflow.com/questions/42916989/replace-missing-values-in-all-columns-except-one-in-pandas-dataframe)
    data_panda_1 = raw_df.loc[:, raw_df.columns.difference(['poi','email_address'])].replace('NaN',0, regex=True)
    # caz we have to now concat caz of these cols lost in above df
    data_panda_2 = raw_df[['poi','email_address']]
    # concat (remember axis 1)
    data_panda = pd.concat([data_panda_1, data_panda_2],axis=1, sort=False)
    # fix exponential display formatdue to replace operaiton (ref: https://stackoverflow.com/questions/42735541/customized-float-formatting-in-a-pandas-dataframe)
    pd.options.display.float_format = lambda x : '{:.0f}'.format(x) # we dont both about decimal accuracies here for calculated means, stds etc.    
    return data_panda

test_df = clean_df(df)  
test_df.head()
Out[53]:
bonus deferral_payments deferred_income director_fees exercised_stock_options expenses from_messages from_poi_to_this_person from_this_person_to_poi loan_advances ... other restricted_stock restricted_stock_deferred salary shared_receipt_with_poi to_messages total_payments total_stock_value poi email_address
ALLEN PHILLIP K 4175000 2869717 -3081055 0 1729541 13868 2195 47 65 0 ... 152 126027 -126027 201955 1407 2902 4484442 1729541 False phillip.allen@enron.com
BADUM JAMES P 0 178980 0 0 257817 3486 0 0 0 0 ... 0 0 0 0 0 0 182466 257817 False NaN
BANNANTINE JAMES M 0 0 -5104 0 4046157 56301 29 39 0 0 ... 864523 1757552 -560222 477 465 566 916197 5243487 False james.bannantine@enron.com
BAXTER JOHN C 1200000 1295738 -1386055 0 6680544 11200 0 0 0 0 ... 2660303 3942714 0 267102 0 0 5634343 10623258 False NaN
BAY FRANKLIN R 400000 260455 -201641 0 0 129142 0 0 0 0 ... 69 145796 -82782 239671 0 0 827696 63014 False frank.bay@enron.com

5 rows ' 21 columns

Now, let us include in our main workflow and check scores..

In [54]:
from helpers_2 import init, split_data, tree_classify, evaluate
from helpers_2 import clean_df
import pandas as pd

# raw data
data_dict = init()

# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index')  # 'index' means keys should be rows

#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
df = clean_df(df)  
# 1.2. Remove Outliers

# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------

# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}

# CLASSIFIER
my_dataset = data_dict  # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0)  # random state 0 to maintain consistent results everytime its run

# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)
Udacity's Evaluation:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Accuracy: 0.69210   Precision: 0.23619  Recall: 0.24150 F1: 0.23881 F2: 0.24042
Total predictions: 10000
True positives:  483    False positives: 1562   False negatives: 1517   True negatives: 6438

No difference, this only means sklearn simply took 'NaN's as 0, or ignored. Nevertheless, our df is now cleaner for spotting outliers.

1.2 Remove Outliers

Outliers could be of many types. For numerical features, some of them could be

  1. Those that max out too farther away in either positive or negative direction (can spot in graph)
  2. Those that are out of place (for eg, should be always positive, but outlier is negative)

The 1st type is easy to detect usually by visualizing. However, some may require more than 1 iteration. That is, an outlier may become visibly obvious, only after removing another which was much farther.

The 2nd type is usually harder to detect, as that needs knowledge about the features. For eg, we have total_stock_value, which is supposed to be always positive because its the sum of stock prices which themselves cannot be negative, so a value negative under this could be an outlier. If too negative, it might show up in visuals, but if not, might be difficult to spot.


Outliers thus, by nature, may typically require iterative investigation and feature knowledge apart from just visualization. Also we shall remove only if outlier is not POI

Outlier Iteration 1

Let us start spotting by visualizing. For brevity, the visualizing functions are abstracted. The outliers are annotated as well (I had to code upfront who are they to annotate)

We try to spot the outliers of type 1 - maxing out in positive direction

In [55]:
from helpers_2 import stripplots, get_feature_max

# test df for our analysis
test_df = df  

# outliers in this iteration (can create this list after once visualizing without this annotation and choose features from graphs)
outlier_feature_list = [
    'bonus', 'deferral_payments', 'director_fees','exercised_stock_options',
    'expenses','from_messages', 'from_poi_to_this_person', 'loan_advances',
    'long_term_incentive','other','restricted_stock','restricted_stock_deferred',
    'salary','total_payments','total_stock_value'
    ]
outlier_dict = get_feature_max(test_df, outlier_feature_list)

# plot
stripplots(test_df, outlier_dict)

Let us find out who are those outliers, and remove them first.

In [56]:
from helpers_2 import get_index_label

outlier_list = []
for k,v in outlier_dict.iteritems():
    outlier = get_index_label(test_df, k, v)
    outlier_list.append(outlier)

outlier_list = list(set(outlier_list)) # to remove duplicates
outlier_list
Out[56]:
['LAVORATO JOHN J', 'TOTAL', 'KAMINSKI WINCENTY J', 'BHATNAGAR SANJAY']

TOTAL obviously seems to be an entry error, that the TOTAL values of features from source, are also taken as a valid data point by mistake. Others are not POIs anyway, so we go ahead removing them all.

In [57]:
test_df = test_df.drop(outlier_list)

Outlier Iteration 2

Now that we have removed some of the outliers, let us visualize again with updated data frame. This time, we also inspect negative max outlier.

In [58]:
from helpers_2 import get_feature_min

# outliers in this iteration (can create this list after once visualizing without this annotation and choose features from graphs)
outlier_feature_list = ['deferral_payments', 'from_messages', 'long_term_incentive' ]
outlier_dict = get_feature_max(test_df, outlier_feature_list)
outlier_dict_min = get_feature_min(test_df, ['restricted_stock_deferred'])  # negative max as seen on graph
outlier_dict.update(outlier_dict_min)

# plot
stripplots(test_df, outlier_dict)
In [59]:
outlier_list = []
for k,v in outlier_dict.iteritems():
    outlier = get_index_label(test_df, k, v)
    outlier_list.append(outlier)

#outlier_list = list(set(outlier_list)) # to remove duplicates
outlier_list
Out[59]:
['KEAN STEVEN J', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'MARTIN AMANDA K']
In [60]:
test_df = test_df.drop(outlier_list)

Outlier Iteration 3

We have almost got rid of type 1 outliers. Type 2 was difficult ( I found this only because of referencing how others did it online ). To spot type 2, let us use pandas description feature, instead of visuals.

In [61]:
description = test_df.describe().loc[['min','max','75%','mean']]
print description 
       bonus  deferral_payments  deferred_income  director_fees  \
min        0            -102500         -3504386              0   
max  7000000            3131860                0         125034   
75%   797188               5971                0              0   
mean  616983             184554          -168401           9415   

      exercised_stock_options  expenses  from_messages  \
min                         0         0              0   
max                  34348384    228763           4343   
75%                   1624050     52905             47   
mean                  1941625     34584            199   

      from_poi_to_this_person  from_this_person_to_poi  loan_advances  \
min                         0                        0              0   
max                       305                      609       81525000   
75%                        39                       12              0   
mean                       33                       18         593659   

      long_term_incentive    other  restricted_stock  \
min                     0        0                 0   
max               3600000 10359729          14761694   
75%                361719   149831            661377   
mean               279831   234830            843709   

      restricted_stock_deferred  salary  shared_receipt_with_poi  to_messages  \
min                     -560222       0                        0            0   
max                       44093 1111258                     5521        15149   
75%                           0  263257                      862         1326   
mean                     -22440  172338                      635         1060   

      total_payments  total_stock_value  
min                0             -44093  
max        103559793           49110078  
75%          1835865            2180983  
mean         1959256            2762231  

Most of the max values found above, even if found too much compared to 75% could still be ignored, because we alreadyd visually analyzed that they are POIs, so nothing much of non-POIs standing out.

Let us check if total_payments and total_stock_value are properly calculated. They are sum of individual related feature columns as shown below.

Type%202%20outlier.png

Checking total payments

In [62]:
total_payments_components = ['salary', 'bonus', 'long_term_incentive', 'deferred_income', 'deferral_payments','loan_advances','other','expenses','director_fees']

# calculate sum of individual components and store in 'total_payments_check'
total_df = test_df[total_payments_components]
total_df = total_df.assign(total_check=test_df[total_payments_components].sum(axis=1)) # thanks: https://stackoverflow.com/questions/33750326/compute-row-average-in-pandas

# compare
import numpy as np
total_df['result'] = np.where( (test_df['total_payments'] == total_df['total_check']) ,'correct','wrong')

total_df['result'].value_counts()
Out[62]:
correct    137
wrong        1
Name: result, dtype: int64
In [63]:
print get_index_label(total_df,'result','wrong')
BELFER ROBERT

Checking total stocks

In [64]:
total_stock_components = ['exercised_stock_options','restricted_stock','restricted_stock_deferred']

# calculate sum of individual components and store in 'total_payments_check'
total_df = test_df[total_payments_components]
total_df = total_df.assign(total_check=test_df[total_stock_components].sum(axis=1)) # thanks: https://stackoverflow.com/questions/33750326/compute-row-average-in-pandas

# compare
import numpy as np
total_df['result'] = np.where( (test_df['total_stock_value'] == total_df['total_check']) ,'correct','wrong')

total_df['result'].value_counts()
Out[64]:
correct    137
wrong        1
Name: result, dtype: int64
In [65]:
print get_index_label(total_df,'result','wrong')
BELFER ROBERT

So we have a type 2 outlier here. Let us check if he is a POI.

In [66]:
def is_POI(input_df, index_label):
    """
    Given a data frame and index label this function returns if given index label is POI (true) or not
    """
    return input_df.loc[index_label,'poi']

is_POI(test_df, 'BELFER ROBERT')
Out[66]:
False

He is not a POI. Thus we could remove him as well.

In [67]:
outlier_list = ['BELFER ROBERT']
test_df = test_df.drop(outlier_list)

Outlier Summary

Through 3 iterations (2 visual, 1 computative), we have found few outliers which we shall remove and check our scores.

In [68]:
def remove_outliers(input_df, outlier_list):
    return input_df.drop(outlier_list)
In [69]:
from helpers_2 import init, split_data, tree_classify, evaluate
from helpers_2 import clean_df
import pandas as pd

# raw data
data_dict = init()

# Dict to Df
df = pd.DataFrame.from_dict(data_dict, orient='index')  # 'index' means keys should be rows

#-------------- PANDAS AREA --------------------
# 1. PRE PROCESSING
# 1.1. Cleaning Invalid Values ('NaN')
df = clean_df(df)  
# 1.2. Remove Outliers
outlier_list = ['LAVORATO JOHN J', 'TOTAL', 'KAMINSKI WINCENTY J', 'BHATNAGAR SANJAY']  # iteration 1
outlier_list += ['KEAN STEVEN J', 'DERRICK JR. JAMES V', 'FREVERT MARK A', 'MARTIN AMANDA K'] # iteration 2
outlier_list += ['BELFER ROBERT'] # iteration 3
df = remove_outliers(df, outlier_list)

# 2. STANDARDIZATION
# 2.1. Feature Scaling
# 2.2. Feature Selection
features_list = ['poi','salary']
#-------------- PANDAS AREA --------------------

# Df to Dict
data_dict = df.to_dict('index') # 'index' means convert to dict like {index -> {column -> value}}

# CLASSIFIER
my_dataset = data_dict  # just for udacity's convention
labels, features = split_data(my_dataset, features_list)
clf = tree_classify(random_state=0)  # random state 0 to maintain consistent results everytime its run

# EVALUATION
trained_clf = evaluate(clf, my_dataset, features_list)
Udacity's Evaluation:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

Accuracy: 0.71656   Precision: 0.34158  Recall: 0.29700 F1: 0.31773 F2: 0.30496
Total predictions: 9000
True positives:  594    False positives: 1145   False negatives: 1406   True negatives: 5855

Though accuracy slighly improved, Precision and Recall fared well compared to scores before Outlier removal.

Note, we still only use one feature salary for classification, which could be reason for low accuracy. In next section we will explore adding more features and scaling them.

So far,

  1. We cleaned the data (replace NaN with 0)
  2. We removed few outliers

This nearly concludes our Data Pre Processing

In [70]:
from graphviz import Digraph
g = Digraph('Work Flow',  node_attr={'shape': 'record'})
g.attr(rankdir='LR')

g.node('step_0', r'Dict To Df', style='filled', fillcolor='#BBDEFB')
g.node('step_1', r'1. Data\nPreProcessing',style='filled', fillcolor='#4be00b')
g.node('step_2', r'2. Data\nStandardization', style='filled', fillcolor='#e6a26a')
g.node('step_2_5', r'Df to Dict', style='filled', fillcolor='#BBDEFB')

g.node('step_3', r'3. Classifier\nTuning', style='filled', fillcolor='#e6a26a')
g.node('step_4', r'4. Evaluation\nMetrics', style='filled', fillcolor='#e6a26a')

g.edges([('step_0','step_1'),('step_1','step_2'),('step_2','step_2_5'),('step_2_5','step_3'),('step_3','step_4')])
g
Out[70]:
Work Flow step_0 Dict To Df step_1 1. Data PreProcessing step_0->step_1 step_2 2. Data Standardization step_1->step_2 step_2_5 Df to Dict step_2->step_2_5 step_3 3. Classifier Tuning step_2_5->step_3 step_4 4. Evaluation Metrics step_3->step_4